37 datasets found
  1. Mean absolute percentage errors of gold loss for error variances of 1, 3, 5,...

    • plos.figshare.com
    xls
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t023
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.

  2. Imputation missing values in the nominal datasets

    • kaggle.com
    zip
    Updated Jan 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Awsan thabet salem (2023). Imputation missing values in the nominal datasets [Dataset]. https://www.kaggle.com/datasets/awsanthabetsalem/imputation-in-arabic-dataset/data
    Explore at:
    zip(16588335 bytes)Available download formats
    Dataset updated
    Jan 29, 2023
    Authors
    Awsan thabet salem
    Description

    The folder contains three datasets: Zomato restaurants, Restaurants on Yellow Pages, and Arabic poetry. Where all datasets have been taken from Kaggle and made some modifications by adding missing values, where the missing values are referred to as symbol (?). The experiment has been done to experiment with the processes of imputation missing values on nominal values. The missing values in the three datasets are in the range of 10%-80%.

    The Arabic dataset has several modifications as follows: 1. Delete the columns that contain English values such as Id, poem_link, poet link. The reason is the need to evaluate the ERAR method on the Arabic data set. 2. Add diacritical marks to some records to check the effect of diacritical marks during frequent itemset generation. note: the results of the experiment on the Arabic dataset will be find in the paper under the title "Missing values imputation in Arabic datasets using enhanced robust association rules"

  3. f

    Mean absolute percentage errors for the scenario where error variance was 9;...

    • plos.figshare.com
    xls
    Updated Mar 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors for the scenario where error variance was 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40% with RSS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t021
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mean absolute percentage errors for the scenario where error variance was 9; sample sizes of 20, 40, 60, 80, 100, 120, 200, and 500; and missing value percentages of 5, 10, 15, 20, 30, and 40% with RSS in missing value estimation methods.

  4. 2

    QLFS

    • datacatalogue.ukdataservice.ac.uk
    Updated Sep 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for National Statistics (2025). QLFS [Dataset]. http://doi.org/10.5255/UKDA-SN-9445-1
    Explore at:
    Dataset updated
    Sep 16, 2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Office for National Statistics
    Area covered
    United Kingdom
    Description
    Background
    The Labour Force Survey (LFS) is a unique source of information using international definitions of employment and unemployment and economic inactivity, together with a wide range of related topics such as occupation, training, hours of work and personal characteristics of household members aged 16 years and over. It is used to inform social, economic and employment policy. The LFS was first conducted biennially from 1973-1983. Between 1984 and 1991 the survey was carried out annually and consisted of a quarterly survey conducted throughout the year and a 'boost' survey in the spring quarter (data were then collected seasonally). From 1992 quarterly data were made available, with a quarterly sample size approximately equivalent to that of the previous annual data. The survey then became known as the Quarterly Labour Force Survey (QLFS). From December 1994, data gathering for Northern Ireland moved to a full quarterly cycle to match the rest of the country, so the QLFS then covered the whole of the UK (though some additional annual Northern Ireland LFS datasets are also held at the UK Data Archive). Further information on the background to the QLFS may be found in the documentation.

    Household datasets
    Up to 2015, the LFS household datasets were produced twice a year (April-June and October-December) from the corresponding quarter's individual-level data. From January 2015 onwards, they are now produced each quarter alongside the main QLFS. The household datasets include all the usual variables found in the individual-level datasets, with the exception of those relating to income, and are intended to facilitate the analysis of the economic activity patterns of whole households. It is recommended that the existing individual-level LFS datasets continue to be used for any analysis at individual level, and that the LFS household datasets be used for analysis involving household or family-level data. From January 2011, a pseudonymised household identifier variable (HSERIALP) is also included in the main quarterly LFS dataset instead.

    Change to coding of missing values for household series
    From 1996-2013, all missing values in the household datasets were set to one '-10' category instead of the separate '-8' and '-9' categories. For that period, the ONS introduced a new imputation process for the LFS household datasets and it was necessary to code the missing values into one new combined category ('-10'), to avoid over-complication. This was also in line with the Annual Population Survey household series of the time. The change was applied to the back series during 2010 to ensure continuity for analytical purposes. From 2013 onwards, the -8 and -9 categories have been reinstated.

    LFS Documentation
    The documentation available from the Archive to accompany LFS datasets largely consists of the latest version of each volume alongside the appropriate questionnaire for the year concerned. However, LFS volumes are updated periodically by ONS, so users are advised to check the ONS
    LFS User Guidance page before commencing analysis.

    Additional data derived from the QLFS
    The Archive also holds further QLFS series: End User Licence (EUL) quarterly datasets; Secure Access datasets (see below); two-quarter and five-quarter longitudinal datasets; quarterly, annual and ad hoc module datasets compiled for Eurostat; and some additional annual Northern Ireland datasets.

    End User Licence and Secure Access QLFS Household datasets
    Users should note that there are two discrete versions of the QLFS household datasets. One is available under the standard End User Licence (EUL) agreement, and the other is a Secure Access version. Secure Access household datasets for the QLFS are available from 2009 onwards, and include additional, detailed variables not included in the standard EUL versions. Extra variables that typically can be found in the Secure Access versions but not in the EUL versions relate to: geography; date of birth, including day; education and training; household and family characteristics; employment; unemployment and job hunting; accidents at work and work-related health problems; nationality, national identity and country of birth; occurrence of learning difficulty or disability; and benefits. For full details of variables included, see data dictionary documentation. The Secure Access version (see SN 7674) has more restrictive access conditions than those made available under the standard EUL. Prospective users will need to gain ONS Accredited Researcher status, complete an extra application form and demonstrate to the data owners exactly why they need access to the additional variables. Users are strongly advised to first obtain the standard EUL version of the data to see if they are sufficient for their research requirements.

    Changes to variables in QLFS Household EUL datasets
    In order to further protect respondent confidentiality, ONS have made some changes to variables available in the EUL datasets. From July-September 2015 onwards, 4-digit industry class is available for main job only, meaning that 3-digit industry group is the most detailed level available for second and last job.

    Review of imputation methods for LFS Household data - changes to missing values
    A review of the imputation methods used in LFS Household and Family analysis resulted in a change from the January-March 2015 quarter onwards. It was no longer considered appropriate to impute any personal characteristic variables (e.g. religion, ethnicity, country of birth, nationality, national identity, etc.) using the LFS donor imputation method. This method is primarily focused to ensure the 'economic status' of all individuals within a household is known, allowing analysis of the combined economic status of households. This means that from 2015 larger amounts of missing values ('-8'/-9') will be present in the data for these personal characteristic variables than before. Therefore if users need to carry out any time series analysis of households/families which also includes personal characteristic variables covering this time period, then it is advised to filter off 'ioutcome=3' cases from all periods to remove this inconsistent treatment of non-responders.

    Occupation data for 2021 and 2022 data files

    The ONS has identified an issue with the collection of some occupational data in 2021 and 2022 data files in a number of their surveys. While they estimate any impacts will be small overall, this will affect the accuracy of the breakdowns of some detailed (four-digit Standard Occupational Classification (SOC)) occupations, and data derived from them. Further information can be found in the ONS article published on 11 July 2023: https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/articles/revisionofmiscodedoccupationaldataintheonslabourforcesurveyuk/january2021toseptember2022" style="background-color: rgb(255, 255, 255);">Revision of miscoded occupational data in the ONS Labour Force Survey, UK: January 2021 to September 2022.

  5. Mean absolute percentage errors of platinum loss for the scenarios with...

    • plos.figshare.com
    xls
    Updated Mar 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors of platinum loss for the scenarios with error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30, and 40% with SRS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t027
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Mar 17, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Mean absolute percentage errors of platinum loss for the scenarios with error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30, and 40% with SRS in missing value estimation methods.

  6. f

    Data from: A Bayesian Approach to Parameter Estimation in the Presence of...

    • tandf.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Domenica Panzera; Roberto Benedetti; Paolo Postiglione (2023). A Bayesian Approach to Parameter Estimation in the Presence of Spatial Missing Data [Dataset]. http://doi.org/10.6084/m9.figshare.3688599.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Domenica Panzera; Roberto Benedetti; Paolo Postiglione
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The missing data problem has been widely addressed in the literature. The traditional methods for handling missing data may be not suited to spatial data, which can exhibit distinctive structures of dependence and/or heterogeneity. As a possible solution to the spatial missing data problem, this paper proposes an approach that combines the Bayesian Interpolation method [Benedetti, R. & Palma, D. (1994) Markov random field-based image subsampling method, Journal of Applied Statistics, 21(5), 495–509] with a multiple imputation procedure. The method is developed in a univariate and a multivariate framework, and its performance is evaluated through an empirical illustration based on data related to labour productivity in European regions.

  7. Extrovert vs. Introvert Behavior Data

    • kaggle.com
    zip
    Updated Jun 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rakesh Kapilavayi (2025). Extrovert vs. Introvert Behavior Data [Dataset]. https://www.kaggle.com/datasets/rakeshkapilavai/extrovert-vs-introvert-behavior-data/discussion
    Explore at:
    zip(31277 bytes)Available download formats
    Dataset updated
    Jun 13, 2025
    Authors
    Rakesh Kapilavayi
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Overview

    Dive into the Extrovert vs. Introvert Personality Traits Dataset, a rich collection of behavioral and social data designed to explore the spectrum of human personality. This dataset captures key indicators of extroversion and introversion, making it a valuable resource for psychologists, data scientists, and researchers studying social behavior, personality prediction, or data preprocessing techniques.

    Context

    Personality traits like extroversion and introversion shape how individuals interact with their social environments. This dataset provides insights into behaviors such as time spent alone, social event attendance, and social media engagement, enabling applications in psychology, sociology, marketing, and machine learning. Whether you're predicting personality types or analyzing social patterns, this dataset is your gateway to uncovering fascinating insights.

    Dataset Details

    Size: The dataset contains 2,900 rows and 8 columns.

    Features:

      - Time_spent_Alone: Hours spent alone daily (0–11).
      - Stage_fear: Presence of stage fright (Yes/No).
      - Social_event_attendance: Frequency of social events (0–10).
      - Going_outside: Frequency of going outside (0–7).
      - Drained_after_socializing: Feeling drained after socializing (Yes/No).
      - Friends_circle_size: Number of close friends (0–15).
      - Post_frequency: Social media post frequency (0–10).
      - Personality: Target variable (Extrovert/Introvert).*
    

    Data Quality: Includes some missing values, ideal for practicing imputation and preprocessing. Format: Single CSV file, compatible with Python, R, and other tools.*

    Data Quality Notes

    • Contains missing values in columns like Time_spent_Alone and Going_outside, offering opportunities for data cleaning practice.
    • Balanced classes ensure robust model training.
    • Binary categorical variables simplify encoding tasks.

    Potential Use Cases

    • Build machine learning models to predict personality types.
    • Analyze correlations between social behaviors and personality traits.
    • Explore social media engagement patterns.
    • Practice data preprocessing techniques like imputation and encoding.
    • Create visualizations to uncover behavioral trends.
  8. n

    Data from: Biological traits of seabirds predict extinction risk and...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Mar 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerren Richards; Robert Cooke; Amanda Bates (2021). Biological traits of seabirds predict extinction risk and vulnerability to anthropogenic threats [Dataset]. http://doi.org/10.5061/dryad.x69p8czhd
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 16, 2021
    Dataset provided by
    Memorial University of Newfoundland
    University of Gothenburg
    Authors
    Cerren Richards; Robert Cooke; Amanda Bates
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Aim

    Seabirds are heavily threatened by anthropogenic activities and their conservation status is deteriorating rapidly. Yet, these pressures are unlikely to uniformly impact all species. It remains an open question if seabirds with similar ecological roles are responding similarly to human pressures. Here we aim to: 1) test whether threatened vs non-threatened seabirds are separated in trait space; 2) quantify the similarity of species’ roles (redundancy) per IUCN Red List Category; and 3) identify traits that render species vulnerable to anthropogenic threats.

    Location

    Global

    Time period

    Contemporary

    Major taxa studied

    Seabirds

    Methods

    We compile and impute eight traits that relate to species’ vulnerabilities and ecosystem functioning across 341 seabird species. Using these traits, we build a mixed-data PCA of species’ trait space. We quantify trait redundancy using the unique trait combinations (UTCs) approach. Finally, we employ a SIMPER analysis to identify which traits explain the greatest difference between threat groups.

    Results

    We find seabirds segregate in trait space based on threat status, indicating anthropogenic impacts are selectively removing large, long-lived, pelagic surface feeders with narrow habitat breadths. We further find that threatened species have higher trait redundancy, while non-threatened species have relatively limited redundancy. Finally, we find that species with narrow habitat breadths, fast reproductive speeds, and varied diets are more likely to be threatened by habitat-modifying processes (e.g., pollution and natural system modifications); whereas pelagic specialists with slow reproductive speeds and varied diets are vulnerable to threats that directly impact survival and fecundity (e.g., invasive species and biological resource use) and climate change. Species with no threats are non-pelagic specialists with invertebrate diets and fast reproductive speeds.

    Main conclusions

    Our results suggest both threatened and non-threatened species contribute unique ecological strategies. Consequently, conserving both threat groups, but with contrasting approaches may avoid potential changes in ecosystem functioning and stability.

    Methods ​​​​Trait Selection and Data

    We compiled data from multiple databases for eight traits across all 341 extant species of seabirds. Here we recognise seabirds as those that feed at sea, either nearshore or offshore, but excluding marine ducks. These traits encompass the varying ecological and life history strategies of seabirds, and relate to ecosystem functioning and species’ vulnerabilities. We first extracted the trait data for body mass, clutch size, habitat breadth and diet guild from a recently compiled trait database for birds (Cooke, Bates, et al., 2019). Generation length and migration status were compiled from BirdLife International (datazone.birdlife.org), and pelagic specialism and foraging guild from Wilman et al. (2014). We further compiled clutch size information for 84 species through a literature search.

    Foraging and diet guild describe the most dominant foraging strategy and diet of the species. Wilman et al. (2014) assigned species a score from 0 to 100% for each foraging and diet guild based on their relative usage of a given category. Using these scores, species were classified into four foraging guild categories (diver, surface, ground, and generalist foragers) and three diet guild categories (omnivore, invertebrate, and vertebrate & scavenger diets). Each was assigned to a guild based on the predominant foraging strategy or diet (score > 50%). Species with category scores < 50% were classified as generalists for the foraging guild trait and omnivores for the diet guild trait. Body mass was measured in grams and was the median across multiple databases. Habitat breadth is the number of habitats listed as suitable by the International Union for Conservation of Nature (IUCN, iucnredlist.org). Generation length describes the mean age in years at which a species produces offspring. Clutch size is the number of eggs per clutch (the central tendency was recorded as the mean or mode). Migration status describes whether a species undertakes full migration (regular or seasonal cyclical movements beyond the breeding range, with predictable timing and destinations) or not. Pelagic specialism describes whether foraging is predominantly pelagic. To improve normality of the data, continuous traits, except clutch size, were log10 transformed.

    Multiple Imputation

    All traits had more than 80% coverage for our list of 341 seabird species, and body mass and habitat breadth had complete species coverage. To achieve complete species trait coverage, we imputed missing data for clutch size (4 species), generation length (1 species), diet guild (60 species), foraging guild (60 species), pelagic specialism (60 species) and migration status (3 species). The imputation approach has the advantage of increasing the sample size and consequently the statistical power of any analysis whilst reducing bias and error (Kim, Blomberg, & Pandolfi, 2018; Penone et al., 2014; Taugourdeau, Villerd, Plantureux, Huguenin-Elie, & Amiaud, 2014).

    We estimated missing values using random forest regression trees, a non-parametric imputation method, based on the ecological and phylogenetic relationships between species (Breiman, 2001; Stekhoven & Bühlmann, 2012). This method has high predictive accuracy and the capacity to deal with complexity in relationships including non-linearities and interactions (Cutler et al., 2007). To perform the random forest multiple imputations, we used the missForest function from package “missForest” (Stekhoven & Bühlmann, 2012). We imputed missing values based on the ecological (the trait data) and phylogenetic (the first 10 phylogenetic eigenvectors, detailed below) relationships between species. We generated 1,000 trees - a cautiously large number to increase predictive accuracy and prevent overfitting (Stekhoven & Bühlmann, 2012). We set the number of variables randomly sampled at each split (mtry) as the square-root of the number variables included (10 phylogenetic eigenvectors, 8 traits; mtry = 4); a useful compromise between imputation error and computation time (Stekhoven & Bühlmann, 2012). We used a maximum of 20 iterations (maxiter = 20), to ensure the imputations finished due to the stopping criterion and not due to the limit of iterations (the imputed datasets generally finished after 4 – 10 iterations).

    Due to the stochastic nature of the regression tree imputation approach, the estimated values will differ slightly each time. To capture this imputation uncertainty and to converge on a reliable result, we repeated the process 15 times, resulting in 15 trait datasets, which is suggested to be sufficient (González-Suárez, Zanchetta Ferreira, & Grilo, 2018; van Buuren & Groothuis-Oudshoorn, 2011). We took the mean values for continuous traits and modal values for categorical traits across the 15 datasets for subsequent analyses.

    Phylogenetic data can improve the estimation of missing trait values in the imputation process (Kim et al., 2018; Swenson, 2014), because closely related species tend to be more similar to each other (Pagel, 1999) and many traits display high degrees of phylogenetic signal (Blomberg, Garland, & Ives, 2003). Phylogenetic information was summarised by eigenvectors extracted from a principal coordinate analysis, representing the variation in the phylogenetic distances among species (Jose Alexandre F. Diniz-Filho et al., 2012; José Alexandre Felizola Diniz-Filho, Rangel, Santos, & Bini, 2012). Bird phylogenetic distance data (Prum et al., 2015) were decomposed into a set of orthogonal phylogenetic eigenvectors using the Phylo2DirectedGraph and PEM.build functions from the “MPSEM” package (Guenard & Legendre, 2018). Here, we used the first 10 phylogenetic eigenvectors, which have previously been shown to minimise imputation error (Penone et al., 2014). These phylogenetic eigenvectors summarise major phylogenetic differences between species (Diniz-Filho et al., 2012) and captured 61% of the variation in the phylogenetic distances among seabirds. Still, these eigenvectors do not include fine-scale differences between species (Diniz-Filho et al., 2012), however the inclusion of many phylogenetic eigenvectors would dilute the ecological information contained in the traits, and could lead to excessive noise (Diniz-Filho et al., 2012; Peres‐Neto & Legendre, 2010). Thus, including the first 10 phylogenetic eigenvectors reduces imputation error and ensures a balance between including detailed phylogenetic information and diluting the information contained in the other traits.

    To quantify the average error in random forest predictions across the imputed datasets (out-of-bag error), we calculated the mean normalized root squared error and associated standard deviation across the 15 datasets for continuous traits (clutch size = 13.3 ± 0.35 %, generation length = 0.6 ± 0.02 %). For categorical data, we quantified the mean percentage of traits falsely classified (diet guild = 28.6 ± 0.97 %, foraging guild = 18.0 ± 1.05 %, pelagic specialism = 11.2 ± 0.66 %, migration status = 18.8 ± 0.58 %). Since body mass and habitat breadth have complete trait coverage, they did not require imputation. Low imputation accuracy is reflected in high out-of-bag error values where diet guild had the lowest imputation accuracy with 28.6% wrongly classified on average. Diet is generally difficult to predict (Gainsbury, Tallowin, & Meiri, 2018), potentially due to species’ high dietary plasticity (Gaglio, Cook, McInnes, Sherley, & Ryan, 2018) and/or the low phylogenetic conservatism of diet (Gainsbury et al., 2018). With this caveat in mind, we chose dietary guild, as more coarse dietary classifications are more

  9. Diabetes Risk & Lifestyle Factors

    • kaggle.com
    zip
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Diabetes Risk & Lifestyle Factors [Dataset]. https://www.kaggle.com/datasets/miadul/diabetes-risk-and-lifestyle-factors
    Explore at:
    zip(76925 bytes)Available download formats
    Dataset updated
    Nov 18, 2025
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    A fully synthetic but realistic dataset created for Machine Learning practice, focusing on how lifestyle, health habits, and medical indicators influence the risk of diabetes.

    This dataset is ideal for:

    • Classification (Diabetes Yes/No)
    • EDA & Visualization
    • Feature Engineering
    • Missing Value Handling
    • Model Benchmarking
    • ML Pipeline Practice (Beginner → Advanced)

    📘 Dataset Overview

    This dataset simulates 5000 individuals with various health and lifestyle attributes. Each record represents one person and includes both clinical metrics (Glucose, BMI, Insulin) and behavioral factors (Diet, Exercise, Smoking, etc.).

    The target variable is Diabetes_Status (0 = No Diabetes, 1 = Diabetes).

    Missing values (~5%) are intentionally added to practice data cleaning, imputation, and analysis.

    📊 Columns Description

    🔹 Medical Indicators

    ColumnDescription
    GlucoseBlood glucose level (mg/dL), influenced by diet and heredity
    BMIBody Mass Index, realistically correlated with diet & exercise
    InsulinInsulin level (µU/mL), correlated with glucose
    AgeAge of the individual (18–80 years)

    🔹 Demographics

    ColumnDescription
    GenderMale or Female

    🔹 Lifestyle Factors

    ColumnDescription
    Diet_TypeHealthy / Moderate / Unhealthy
    Exercise_FrequencyDaily, 3–5 / week, 1–2 / week, Rarely
    HeredityFamily history of diabetes (Yes/No)
    SmokingSmoking habit (Yes/No)
    AlcoholNone / Low / Moderate / High
    Stress_ScoreStress rating from 1 to 10
    Sleep_HoursAverage daily sleep duration

    🎯 Target Variable

    ColumnDescription
    Diabetes_Status0 = Non-diabetic, 1 = Diabetic

    🎯 How Diabetes Status Was Generated (Synthetic Logic)

    The probability of diabetes is influenced by:

    • Higher Glucose & BMI
    • Unhealthy diet
    • Low exercise frequency
    • Family history (heredity)
    • High stress
    • Lifestyle habits (smoking, alcohol)

    This creates realistic correlations suitable for ML models.

    🧩 Missing Values

    Approximately 5% missing values are added randomly across all columns to help learners practice:

    • Imputation (Mean/Median/Mode/KNN)
    • Handling missing categorical & numerical data
    • Impact on model training

    🧠 Ideal ML Use-Cases

    You can use this dataset for:

    ✔ Binary Classification Models

    • Logistic Regression
    • Random Forest
    • XGBoost
    • Neural Networks
    • SVM

    ✔ EDA + Data Visualization

    • Correlation heatmaps
    • Lifestyle vs. Diabetes analysis
    • Glucose/BMI distribution

    ✔ Data Preprocessing

    • Handling missing values
    • Scaling
    • Encoding categorical features

    ✔ Feature Engineering

    • Risk scoring
    • Interaction features
    • Normalization

    💡 Why This Dataset?

    Many diabetes datasets (e.g., Pima) are small and limited. This dataset provides:

    • More features
    • More samples (5000)
    • Better ML exploration
    • Realistic relationships
    • Modern lifestyle variables

    Perfect for students, beginners, and ML practitioners.

    📥 Source

    This dataset is fully synthetic and AI-generated, created exclusively for educational and machine learning purposes.

  10. m

    Panel_democ_stability_growth_MENA_Over_1983_2022

    • data.mendeley.com
    Updated Jun 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brahim Zirari (2023). Panel_democ_stability_growth_MENA_Over_1983_2022 [Dataset]. http://doi.org/10.17632/vhh9cg2wzt.3
    Explore at:
    Dataset updated
    Jun 23, 2023
    Authors
    Brahim Zirari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This panel dataset presents information on the impact of democracy and political stability on economic growth in 15 MENA countries for the period 1983-2022. The data are collected from five different sources; the World Bank Development Indicators (WDI), the World Bank Governance Indicators (WGI), the Penn World Table (PWT), Polity5 from the Integrated Network for Societal Conflict Research (INSCR), and the Varieties of Democracy (V-Dem). The dataset includes ten variables related to economic growth, democracy, and political stability. Data analysis was performed using statistical methods such as R in order to ensure data reliability through imputing missing data; hence, enabling future researchers to explore the impact of political factors on growth in various contexts. The data are presented in two sheets, before and after the imputation for missing values. The potential reuse of this dataset lies in the ability to examine the impact of different political factors on economic growth in the region.

  11. A Hybrid Educational Dataset

    • kaggle.com
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanoel Carvalho Lopes (2025). A Hybrid Educational Dataset [Dataset]. https://www.kaggle.com/datasets/emanoelcarvalholopes/uci-oulad-sintetico-unificados
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Emanoel Carvalho Lopes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    The early identification of students facing learning difficulties is one of the most critical challenges in modern education. Intervening effectively requires leveraging data to understand the complex interplay between student demographics, engagement patterns, and academic performance.

    This dataset was created to serve as a high-quality, pre-processed resource for building machine learning models to tackle this very problem. It is a unique hybrid dataset, meticulously crafted by unifying three distinct sources:

    The Open University Learning Analytics Dataset (OULAD): A rich dataset detailing student interactions with a Virtual Learning Environment (VLE). We have aggregated the raw, granular data (over 10 million interaction logs) into powerful features, such as total clicks, average assessment scores, and distinct days of activity for each student registration.

    The UCI Student Performance Dataset: A classic educational dataset containing demographic information and final grades in Portuguese and Math subjects from two Portuguese schools.

    A Synthetic Data Component: A synthetically generated portion of the data, created to balance the dataset or represent specific student profiles.

    Data Unification and Pre-processing

    A direct merge of these sources was not possible as the student identifiers were not shared. Instead, a strategy of intelligent concatenation was employed. The final dataset has undergone a rigorous pre-processing pipeline to make it immediately usable for machine learning tasks:

    • Advanced Imputation: Missing values were handled using a sophisticated iterative imputation method powered by Gaussian Mixture Models (GMM), ensuring the dataset's integrity.

    • One-Hot Encoding: All categorical features have been converted to a numerical format.

    • Feature Scaling: All numerical features have been standardized (using StandardScaler) to have a mean of 0 and a standard deviation of 1, preventing model bias from features with different scales.

    The result is a clean, comprehensive dataset ready for modeling.

    File Information

    Instance

    Each row represents a student profile, and the columns are the features and the target.

    Feature

    Features include aggregated online engagement metrics (e.g., clicks, distinct activities), academic performance (grades, scores), and student demographics (e.g., gender, age band). A key feature indicates the original data source (OULAD, UCI, Synthetic).

    Sensitive Information

    The dataset contains no Personally Identifiable Information (PII). Demographic information is presented in broad, anonymized categories.

    Key Columns:

    Target Variable:
    
      had_difficulty: The primary target for classification. This binary variable has been engineered from the original final_result column of the OULAD dataset.
    
        1: The student either failed (Fail) or withdrew (Withdrawn) from the course.
    
        0: The student passed (Pass or Distinction).
    
    Feature Groups:
    
      OULAD Aggregated Features (e.g., oulad_total_cliques, oulad_media_notas): Quantitative metrics summarizing a student's engagement and performance within the VLE.
    
      Academic Performance Features (e.g., nota_matematica_harmonizada): Harmonized grades from different data sources.
    
      Demographic Features (e.g., gender_*, age_band_*): One-hot encoded columns representing student demographics.
    
      Origin Features (e.g., origem_dado_OULAD, origem_dado_UCI): One-hot encoded columns indicating the original source of the data for each row. This allows for source-specific analysis.
    

    (Note: All numerical feature names are post-scaling and may not directly reflect their original names. Please refer to the complete column list for details.)

    Acknowledgements

    This dataset would not be possible without the original data providers. Please acknowledge them in any work that uses this data:

    OULAD Dataset: Kuzilek, J., Hlosta, M., and Zdrahal, Z. (2017). Open University Learning Analytics dataset. Scientific Data, 4. https://analyse.kmi.open.ac.uk/open_dataset
    
    UCI Student Performance Dataset: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS. https://archive.ics.uci.edu/ml/datasets/student+performance
    

    Inspiration

    This dataset is perfect for a variety of predictive modeling tasks. Here are a few ideas to get you started:

    Can you build a classification model to predict had_difficulty with high recall? (Minimizing the number of at-risk students we fail to identify).
    
    • Which features are the most powerful predictors of student failure or withdrawal? (Feature Importance Analysis).

    • Can you build separate models for each data origin (origem_dado_*) and compare ...

  12. f

    Data from: Computational methods to simultaneously compare the predictive...

    • tandf.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. A. Roldán-Nofuentes (2023). Computational methods to simultaneously compare the predictive values of two diagnostic tests with missing data: EM-SEM algorithms and multiple imputation [Dataset]. http://doi.org/10.6084/m9.figshare.14602706.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    J. A. Roldán-Nofuentes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Predictive values are measures of the clinical accuracy of a binary diagnostic test, and depend on the sensitivity and the specificity of the diagnostic test and on the disease prevalence among the population being studied. This article studies hypothesis tests to simultaneously compare the predictive values of two binary diagnostic tests in the presence of missing data. The hypothesis tests were solved applying two computational methods: the expectation maximization and the supplemented expectation maximization algorithms, and multiple imputation. Simulation experiments were carried out to study the sizes and the powers of the hypothesis tests, giving some general rules of application. Two R programmes were written to apply each method, and they are available as supplementary material for the manuscript. The results were applied to the diagnosis of Alzheimer’s disease.

  13. d

    Data from: Quantifying the impacts of management and herbicide resistance on...

    • datadryad.org
    • search.dataone.org
    zip
    Updated Nov 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs (2023). Quantifying the impacts of management and herbicide resistance on regional plant population dynamics in the face of missing data [Dataset]. http://doi.org/10.5061/dryad.9cnp5hqn5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2023
    Dataset provided by
    Dryad
    Authors
    Robert Goodsell; David Comont; Helen Hicks; James Lambert; Richard Hull; Laura Crook; Paolo Fraccaro; Katharina Reusch; Robert Freckleton; Dylan Childs
    Time period covered
    Nov 17, 2023
    Description

    Data were collected from a network of UK farms using a density structured survey method outlined in Queensborough 2011.

  14. f

    Data from: proteiNorm – A User-Friendly Tool for Normalization and Analysis...

    • datasetcatalog.nlm.nih.gov
    Updated Sep 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris (2020). proteiNorm – A User-Friendly Tool for Normalization and Analysis of TMT and Label-Free Protein Quantification [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000568582
    Explore at:
    Dataset updated
    Sep 30, 2020
    Authors
    Byrd, Alicia K; Zafar, Maroof K; Graw, Stefan; Tang, Jillian; Byrum, Stephanie D; Peterson, Eric C.; Bolden, Chris
    Description

    The technological advances in mass spectrometry allow us to collect more comprehensive data with higher quality and increasing speed. With the rapidly increasing amount of data generated, the need for streamlining analyses becomes more apparent. Proteomics data is known to be often affected by systemic bias from unknown sources, and failing to adequately normalize the data can lead to erroneous conclusions. To allow researchers to easily evaluate and compare different normalization methods via a user-friendly interface, we have developed “proteiNorm”. The current implementation of proteiNorm accommodates preliminary filters on peptide and sample levels followed by an evaluation of several popular normalization methods and visualization of the missing value. The user then selects an adequate normalization method and one of the several imputation methods used for the subsequent comparison of different differential expression methods and estimation of statistical power. The application of proteiNorm and interpretation of its results are demonstrated on two tandem mass tag multiplex (TMT6plex and TMT10plex) and one label-free spike-in mass spectrometry example data set. The three data sets reveal how the normalization methods perform differently on different experimental designs and the need for evaluation of normalization methods for each mass spectrometry experiment. With proteiNorm, we provide a user-friendly tool to identify an adequate normalization method and to select an appropriate method for differential expression analysis.

  15. f

    National Panel Survey 2010-2011 - United Republic of Tanzania

    • microdata.fao.org
    Updated Nov 8, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Bureau of Statistics (2022). National Panel Survey 2010-2011 - United Republic of Tanzania [Dataset]. https://microdata.fao.org/index.php/catalog/study/TZA_2010-2011_NPS-W2_v01_EN_M_v01_A_OCS
    Explore at:
    Dataset updated
    Nov 8, 2022
    Dataset authored and provided by
    National Bureau of Statistics
    Time period covered
    2010 - 2011
    Area covered
    Tanzania
    Description

    Abstract

    The main objective of the Tanzania NPS is to provide high-quality household-level data to the Tanzanian government and other stakeholders for monitoring poverty dynamics, tracking the progress of the Mkukuta poverty reduction strategy1, and to evaluate the impact of other major, national-level government policy initiatives. As an integrated survey covering a number of different socioeconomic factors, it compliments other more narrowly focused survey efforts, such as the Demographic and Health Survey on health, the Integrated Labour Force Survey on labour markets, the Household Budget Survey on expenditure, and the National Sample Census of Agriculture. Secondly, as a panel household survey in which the same households are revisited over time, the Tanzania NPS allows for the study of poverty and welfare transitions and the determinants of living standard changes

    Geographic coverage

    National

    Analysis unit

    Households

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample design for the second round of the NPS revisits all the households interviewed in the first round of the panel, as well as tracking adult split-off household members. The original sample size of 3,265 households was designed to representative at the national, urban/rural, and major agro-ecological zones. The total sample size was 3,265 households in 409 Enumeration Areas (2,063 households in rural areas and 1,202 urban areas). It is also be possible in the final analysis to produce disaggregated poverty rates for 4 different strata: Dar es Salaam, other urban areas on mainland Tanzania, rural mainland Tanzania, and Zanzibar.

    Since the NPS is a panel survey, the second round of the fieldwork revisits all households originally interviewed during round one. If a household has moved from its original location, the members were interviewed in their new location. If that location was within one hour of the original location, the field team did the interview at the time of their visit to the enumeration area. If the household had located more than an hour from the original location, details of the new location were recorded on specialized forms, and the information passed to a dedicated tracking team for follow-up.

    If a member of the original household had split from their original location to form or join a new household, information was recorded on the current whereabouts of this member. All adult former household members (those over the age of 15) were tracked to their new location. Similar to the protocol for the re-located households, if the new household is within one hour of the original location, the new household was interviewed by the main field team at the time of the visit to the enumeration area. For those that have moved more than one hour away, their information was passed to the dedicated tracking team for follow-up. Once the tracking targets have been found, teams are required to interview them and any new members of the household.

    The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round-one households, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a re-interview rate of approximately 90 percent.

    Sampling deviation

    The total sample size for the second round of the NPS has a total sample size of 3924 households. This represents 3168 round one household, a re-interview rate of over 97 percent. In addition, of the 10,420 eligible adults (over age 15 in 2010), 9,338 were re-interviewed, a re-interview rate of approximately 90 percent. To obtain the attrition adjustment factor the probability that a sample household was successfully re-interviewed in the second round of surveys is modelled with the linear logistic model at the level of the individual. A binary response variable is created by coding the response disposition for eligible households that do not respond in the second round as 0, and households that do respond as 1.

    Mode of data collection

    Face-to-face [f2f]

    Cleaning operations

    CSPro-based data entry/editing system was used. A cross comparison between the entered values in the field based data entry and double entry was conducted and any differences in values between the two were flagged for manual inspection of the physical questionnaire. Corrections based on this inspection exercise were ultimately encoded in the dataset.

    Additionally, an extensive review of data files was conducted, including interviewer errors such as missing values, ranges and outliers. Observations were returned for manual inspection of the physical questionnaires if continuous values fell outside five standard deviations of the mean, categorical values were not eligible responses, or there were internal inconsistencies within the dataset (for example, the age of an individual was not consistent with their educational status, there was more than one head of household listed, an individual was engaged in multiple primary activities, the quantity of crops and their by-products produced, harvested, and sold not listed, the distance from the market and an individual's plot was not listed, the number of weeks, days per week, and hours per day an individual engaged in fishery activity was not recorded, the species and quantity of fish caught, bought, sold, or traded was not listed, etc). When it was determined that these values were the result of data-entry error, the values were corrected. In addition, cases deemed to reflect obvious enumerator error were also corrected in this cleaning process. The majority of such cases involved the use of incorrect measurement units, e.g. recording grams as kilograms or vice versa.

    Response rate

    Approximately 95 percent

    Sampling error estimates

    To reduce the overall standard errors, and weight the population totals up to the known population figures, a post-stratification correction is applied. Based on the projected number of households in the urban and rural segments of each region, adjustment factors are calculated. This correction also reduces overall standard errors.

    Data appraisal

    The estimated logistic model is used to obtain a predicted probability of response for each household member in the 2010/2011 survey. These response probabilities were then aggregated to the household level (by calculating the mean), the using the household-level predicted response probabilities as the ranking variable, all households are ranked into 10 equal groups (deciles). An attrition adjustment factor was then defined as the reciprocal of the empirical response rate for the household-level propensity

    Then a logistic response propensity model is fitted, using 2005 UNHS household and individual characteristics measured in the first wave as covariates. In a few limited cases, values of unit level variables were missing from the 2008/2009 household dataset. These values were imputed using multivariate regression and logistic regression techniques. Imputations are done using the 'impute' command in Stata at the level of the UNPS strata (urban/rural and region). Overall, less than one percent of the variables required imputation to replace missing values.

  16. t

    Historical global map of NH4+ and NO3- application in synthetic nitrogen...

    • service.tib.eu
    Updated Nov 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Historical global map of NH4+ and NO3- application in synthetic nitrogen fertilizer, link to NetCDF files - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/png-doi-10-1594-pangaea-861203
    Explore at:
    Dataset updated
    Nov 30, 2024
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This paper provides a method for constructing a new historical global nitrogen fertilizer application map (0.5° × 0.5° resolution) for the period 1961-2010 based on country-specific information from Food and Agriculture Organization statistics (FAOSTAT) and various global datasets. This new map incorporates the fraction of NH+4 (and NONO-3) in N fertilizer inputs by utilizing fertilizer species information in FAOSTAT, in which species can be categorized as NH+4 and/or NO-3-forming N fertilizers. During data processing, we applied a statistical data imputation method for the missing data (19 % of national N fertilizer consumption) in FAOSTAT. The multiple imputation method enabled us to fill gaps in the time-series data using plausible values using covariates information (year, population, GDP, and crop area). After the imputation, we downscaled the national consumption data to a gridded cropland map. Also, we applied the multiple imputation method to the available chemical fertilizer species consumption, allowing for the estimation of the NH+4/NO-3 ratio in national fertilizer consumption. In this study, the synthetic N fertilizer inputs in 2000 showed a general consistency with the existing N fertilizer map (Potter et al., 2010, doi:10.1175/2009EI288.1) in relation to the ranges of N fertilizer inputs. Globally, the estimated N fertilizer inputs based on the sum of filled data increased from 15 Tg-N to 110 Tg-N during 1961-2010. On the other hand, the global NO-3 input started to decline after the late 1980s and the fraction of NO-3 in global N fertilizer decreased consistently from 35 % to 13 % over a 50-year period. NH+4 based fertilizers are dominant in most countries; however, the NH+4/NO-3 ratio in N fertilizer inputs shows clear differences temporally and geographically. This new map can be utilized as an input data to global model studies and bring new insights for the assessment of historical terrestrial N cycling changes.

  17. D

    Assessing among-lineage variability in phylogenetic imputation of functional...

    • datasetcatalog.nlm.nih.gov
    • search.dataone.org
    • +1more
    Updated Jan 23, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rodriguez, Miguel Á.; Peres-Neto, Pedro R.; Moreno-Saiz, Juan Carlos; Castro, Isabel Castro; Davies, T. Jonathan; Molina-Venegas, Rafael; Molin-Venegas, Rafael (2018). Assessing among-lineage variability in phylogenetic imputation of functional trait datasets [Dataset]. http://doi.org/10.5061/dryad.12111
    Explore at:
    Dataset updated
    Jan 23, 2018
    Authors
    Rodriguez, Miguel Á.; Peres-Neto, Pedro R.; Moreno-Saiz, Juan Carlos; Castro, Isabel Castro; Davies, T. Jonathan; Molina-Venegas, Rafael; Molin-Venegas, Rafael
    Description

    Phylogenetic imputation has recently emerged as a potentially powerful tool for predicting missing data in functional traits datasets. As such, understanding the limitations of phylogenetic modelling in predicting trait values is critical if we are to use them in subsequent analyses. Previous studies have focused on the relationship between phylogenetic signal and clade-level prediction accuracy, yet variability in prediction accuracy among individual tips of phylogenies remains largely unexplored. Here, we used simulations of trait evolution along the branches of phylogenetic trees to show how the accuracy of phylogenetic imputations is influenced by the combined effects of (1) the amount of phylogenetic signal in the traits and (2) the branch length of the tips to be imputed. Specifically, we conducted cross-validation trials to estimate the variability in prediction accuracy among individual tips on the phylogenies (hereafter “tip-level accuracy”). We found that under a Brownian motion model of evolution (BM, Pagel's λ = 1), tip-level accuracy rapidly decreased with increasing tip branch-lengths, and only tips of approximately 10% or less of the total height of the trees showed consistently accurate predictions (i.e. cross-validation R-squared > 0.75). When phylogenetic signal was weak, the effect of tip branch-length was reduced, becoming negligible for traits simulated with λ < 0.7, where accuracy was in any case low. Our study shows that variability in prediction accuracy among individual tips of the phylogeny should be considered when evaluating the reliability of phylogenetically imputed trait values. To address this challenge, we describe a Monte Carlo-based method that allows one to estimate the expected tip-level accuracy of phylogenetic predictions for continuous traits. Our approach identifies gaps in functional trait datasets for which phylogenetic imputation performs poorly, and will help ecologists to design more efficient trait collection campaigns by focusing resources on lineages whose trait values are more uncertain.

  18. o

    QSAR-DATASET-FOR-DRUG-TARGET-CHEMBL2010624

    • openml.org
    Updated Jul 16, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dr Jeremy Besnard; Dr Ivan Olier; Dr Noureddin Sadawi; Dr Larisa Soldatova; Dr Crina Grosan; Prof Ross King; Dr Richard Bickerton; Prof Andrew Hopkins and Dr Willem van Hoorn (2016). QSAR-DATASET-FOR-DRUG-TARGET-CHEMBL2010624 [Dataset]. https://www.openml.org/d/40430
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2016
    Authors
    Dr Jeremy Besnard; Dr Ivan Olier; Dr Noureddin Sadawi; Dr Larisa Soldatova; Dr Crina Grosan; Prof Ross King; Dr Richard Bickerton; Prof Andrew Hopkins and Dr Willem van Hoorn
    Description

    This dataset contains QSAR data (from ChEMBL version 17) showing activity values (unit is pseudo-pCI50) of several compounds on drug target ChEMBL_ID: CHEMBL2010624 (TID: 104492), and it has 11 rows and 294 features (not including molecule IDs and class feature: molecule_id and pXC50). The features represent Molecular Descriptors which were generated from SMILES strings. Missing value imputation was applied to this dataset (By choosing the Median). Feature selection was also applied.

  19. n

    Data from: A comparison of genomic selection models across time in interior...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 27, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby (2015). A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods [Dataset]. http://doi.org/10.5061/dryad.m4vh4
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 27, 2015
    Dataset provided by
    Ministry of Forests
    University of British Columbia
    Authors
    Blaise Ratcliffe; Omnia Gamal El-Dien; Jaroslav Klápště; Ilga Porth; Charles Chen; Barry Jaquish; Yousry A. El-Kassaby
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    British Columbia
    Description

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3–40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31–0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04–0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.

  20. South African Census 2001, CASASP imputed data - South Africa

    • datafirst.uct.ac.za
    • catalog.ihsn.org
    • +2more
    Updated Mar 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics South Africa (2020). South African Census 2001, CASASP imputed data - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/227
    Explore at:
    Dataset updated
    Mar 29, 2020
    Dataset provided by
    Statistics South Africahttp://www.statssa.gov.za/
    Centre for the Analysis of South African Social Policy
    Time period covered
    2006
    Area covered
    South Africa
    Description

    Abstract

    This dataset includes imputation for missing data in key variables in the ten percent sample of the 2001 South African Census. Researchers at the Centre for the Analysis of South African Social Policy (CASASP) at the University of Oxford used sequential multiple regression techniques to impute income, education, age, gender, population group, occupation and employment status in the dataset. The main focus of the work was to impute income where it was missing or recorded as zero. The imputed results are similar to previous imputation work on the 2001 South African Census, including the single ‘hot-deck’ imputation carried out by Statistics South Africa.

    Kind of data

    Sample survey data [ssd]

    Mode of data collection

    Face-to-face [f2f]

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong (2025). Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods. [Dataset]. http://doi.org/10.1371/journal.pone.0313772.t023
Organization logo

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Mar 17, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Juthaphorn Sinsomboonthong; Saichon Sinsomboonthong
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Mean absolute percentage errors of gold loss for error variances of 1, 3, 5, 7, and 9; a sample sizes of 48; and missing value percentages of 5, 10, 15, 20, 30 and 40% with SRS in missing value estimation methods.

Search
Clear search
Close search
Google apps
Main menu